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I INTRODUCTION 


/ The purpose of statistics, like that of geometry or phy- 
sics, Is to describe certain real phenomena. The objects of 
the real world can never be described in such a complete and 
exact way that they could fom the basis of an exact theory. 

We have to replace them by some idealized objects, defined ex- 
plicitly or implicitly by a system of axioms. I For Instance, in 
geometry we define the basic notions “point,” “straight line,*” 
and “plane” implicitly by a system of axioms. They take the 
place of empirical points, straight lines and planes which are 
not capable of exact definition. In order to apply the theory 
to real phenomena, we need some rules for establishing the cor- 
respondence between the Idealized objects of the theory and 
those, of the real world. *These rules will always, be somewhat 
vague and can never fom a oart of the theoijjjr .ItseiSr. 

(The purpose of statistics is to describe' certain aspects 
of mass phenomena and repetJltlve events. The fundamental 
notion used is that of “probability.” In, thp theory it Is de- 
fined either explicitly or implicitly by a system of axioms.! 
For instance, Ionise s^ ^ defines the pr 0 babl.i 4 .ty of an event as 
the limit of the relative frequency of this event In an infin- 
ite sequence of trials satisfying certain conditions. This is 
an explicit definition of probability. Kolmofeoroff^ ^ Aeflnes 
probability as a set function which satisfies a certaiz^ system 


1) See references 10 and 11 

2) See reference 9 
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of axioms. These idealized mathematical definitions are re- 
lated to the applications of the theory by translating the 
statement "the event E has the probability p" into tlje state- 
ment "the relative frequency of the event £ in a long sequence 
of trials is approximately equal to p." This translation of a 
theoretical statement into an empirical statement is necessar- 
ily somewhat vague , for we have said nothing about the meanings 
of the words "long" or "approximately." But such vagueness is 
always associated with the application of theory to real phen- 
omena. 

It should be remarked that instead of the above translation 
of the word "probability" It is satisfactory to use the follow- 
ing somewhat simpler onet "The event E has a probability near 
to one" is translated into "it Is practically certain that the 
event E will occur in a single trial. In fact, if an event 
£ has the probability p then, according to a theorem of Ber- 
noulli, the probability that the relj^tive frequency of E in a 
sequence of trials will be in a small neighborhood of p is 
arbitrarily near to 1 for a sufficiently long sequence of 
trials. If we translate the expression "probability nearly 1" 
into "practical certainty," we obtain the statement "it is 

C' 

practically certain that the relative frequency of £ in a long 
sequence of trials will be in a small neighborhood of p.") 

! In statistics we always construct some probability scheme# 
which we believe to be adequate to describe certain real phen- 
omena. | (por instance, we describe the situation concerning the 

c 

possible outcomes in tossing a coin by saying that the probabi- 
lity of obtaining a head in one toss is l/2, for in a long se- 
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queno6 of triala we would expect to have about half as msnj 

heads aa total tosses. Or, if we measure the length of a bar 

s 

by some instrument, we sometimes assume that the result is a 
*normally distributed random variable. The notions of a random 
variable and a distribution function are defined as follows: 
if P(x) is a fxmction expressing the probability that a real 
variable X < x, we say that X is a random variable and that 
P(x) is the probability distribution of X. Then-, if P(x) is 
given by the formula 

X -1 (y- u)^ 

( 1 ) P(x) a — i— / e"^ dy 

a -00 

we say that X is normally distributed . The quantities a and p 
are real parameters. Thus, if in measuring the length of a bar 
by some instrument we assume that the outcome of the measure* 
ment is a normally distributed random variable, we may express 
the probability that a measurement will be less than a given 
value X by (1). / 

If Xxf X2» X3, ..., Xn represent n random variables and 
xi, X2,**«« Xj^ any set of real numbers, we use the symbol 
P(zi, X2»«*«f X|^) to express the probability of the composite 
event that Xi < x^, X2 < X2, •••# < ^n simultaneously. This 

function will be called the joint probability distribution of 
the n random variables. We shall say that n random variables 
are independently distributed if the function PCx^, X2f.«# Xj^) 
is the product of n functions such that only x^ is Involved in 
the first, only X2 in the second, and so on. That is 
F(x) ■ 
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For example^ If n measurementa X^, X 2 f >9 of a bar are in- 
dependently and normally distributed with the same nwnal dis- 
tribution, we would obtain 


( 2 ) * 2 ,..., 

*1 

r 


(2tr) 


1 

n/2 i 
0 

J 

-00 




dy 


*2 

le^^dy 


J 

-00 




2o‘ 
e dy 


-00 


If we measure the length of a bar n times by some instru- 
ment, we sometimes find it appropriate to adopt the probability 
scheme that the results of the n measurements have a Joint pro- 
bability distribution given by (2).) 

(^One of the fundamental problems of statistical inference 
is that of testing statistical hypotheses. The most general 
form of a statistical hypothesis we ihave to deal with in 
statistical theory may be expressed Vs follows. Let 
be a finite set of random variables and let PCx^^, . . • ,Xjj) be its 
Joint probability distribution function. Then the statistical 
hypothesis is the statement that the unknown distribution func- 
tion F(x-|^, . . . ,x^) is fin element of a certain class (i) of distri- 
bution functions. For instance, if are successive 

measurements on the length of a bar, we may consider the hypo- 

• • 

thesis that X^y.yX^^ are Independently distributed with the 
same normal dULstribution. In this case o) is a two parameter 
family^given by (2), o being any positive number and ^ any real 
number. 
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If wa consider the hypothesis that X]^*****^ nonnally, 
independently distributed with zero means (^^O) and \mit vari- 
ances (o^=l), then u) consists of a single element. When the 
class 0) consists of a single element, we shall say that the 
hypothesis we are considering is a simple hypothesis . Other- 
wise, it will be called composite . 

The question of testing a given hypothesis may be formu- 
lated in the following manner. We should like to know, on the 
basis of n observations x^, • 0 0 where x^ is the observed vaOue 
of the random variable (a®l,...,n), whether to adcept or re- 
ject the hypothesis that the unknown distribution function 
F(x 3^, • . . belongs to the class u). The set of n observations 
can be represented by a point £ of n-dimensional Cartesian 
apace, called the sample space . To test the hypothesis on 
the basis of n observations we must choose a subset R of the 
sample space and then reject the hypothesis if the sample 
point E falls within R. O^^iherwise, wo maintain the hypothesis. 
It is evident that the furidamental problem here is the choice 
of the subset R, which wo shall call the critical region . The 
solution of this problem depends, to seme extent, upon any 
a priori knowledge we may have about tho,imknown distribution 
function F(Xj^, . . . ,Xj^) • One of t^e most important and most fre- 
quent a priori assiunptions is that the random variables Xp...,X^ 
are independently distributed, each having the same distribu- 
tion. Thus, we have the assumption that F is of the fom 

n • 

P(xi,...,^) where = ^Pj for all^i, J. 

Such a priori knowledge about our unknown distribution 

. 

function can always be exp^ssed by saying that the function 
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F(x 3.« • • is an slement of a certain class jCL^oT distribu- 

tion functions. The class (i> which is being conside^isd is then 
always a subclass ofXl.* We shall see that the choice of the 
critical region R for testing the hypothesis will depend 
upon the a priori knowledge jC 1_. 

It is now seen that the problem of testing hypotheses can 
be formulated as follows > Taking for granted that the unknown 
distribution function F is an element of a class .TL,, we wish 
to test the hypothesis that F belongs to a certain subclass m 
of The problem to be solved is the question of how the 


critical region in the sample space should be chosen. 

For instanoe9jn.may be defined by the statement that 

i^re independently and normally distributed each of 
them having the same distribution, and (*> may be the subclass of 
-TL. defined by the additional restriction that the mean values 



according to certain 
adequate critical region 


where x » " ■ 


n o 

I (Xo-*)^ 
a»l 


and c is a certain constant. If, however, XL is a much broader 

• • 

class defined by the statement that independently 

distributed each having the same distribution, the above criti- 
cal region for testing is not adequate, and some other criti- 

« 

cal region has to be chosen. 
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Before we proceed farther it might be well for us to list 
a few of the mathematical terms used together with their 
meanings in statistics. We can do this in tabular form. 

MATHaiATICAL TERWINOLOOY STATISTICAL INTERPRETATION 

n space, E,. (sample space) Possible outcome of n obser- 
vations . 

S\^, class of functions on Class of possible probability 

distributions. 

(i), subclass of jCL The statistical hypothesis. 

The true distribution is a 
member of m. 

R, (critical region), a Criterion for rejecting the 

subset of Eq hypothesis that the true dis- 

tribution is a member of cs. 

Association of R withJCl. Choice of the critical region 

end tt. for testing the hypothesis. > 

The problem of testing hypotheses is only one of the prob- 
lems of statistical infexHsnce. Another is the problem of es- 
timation . Given that the \inknown distribution function P be- 
longs to a certain class J^Lof distribution functions, how can 
we choose a function <jP(B>« defined for all points E of Bxi suoh 
that the value of is always an element of.fl. and can be 

considered a *^good** estimate of the unlcnown distribution func- 
tion Pt We may say that 9^(E) is a * *goo<l statistical estimate ** 
of P if the probability is as large as possible that y(E) is 
in a small neighborhood of P. We will formulate this principle 
more precisely in chapter HI. * 

If, for instance, jQ.is given by the statement that 

are independently and normally distMbuted with the 
same means and unit vai^ances, then.Cl.is a one parameter 
family of distribution functions and an element of -CL. is com- 
pletely specified by specifying the value of the unknown messi |i. 
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Hence ^ to eetlmate the unlcnown distribution ftinctlon P Is the 
same as to estimate the unknown mean (i.. In this ca«e the pro- 
blem of estimation is the problem of finding a real function 
9P(E) defined for all points E of the sample space such that 
9’(E) can be considered as a statistical astimat;e of the un- 
known mean p* The classical solution of this problem in this 
particular case is given by 

(E) , . 

n 

The two types of problems of statistical inference men- 

3) 

tioned so far do not cover all possible problems. The fol- 
lowing problem^ for example, is neither a problem of testing a 
hypothesis nor one of estimation i Consider three subclasses 

^ cf the class J'l.of distribution functions, ana de- 
note by tne nypotnesis that the unknown distribution F is 
an element of The problem cons^.dered is to decide on the 

basis of the n observations which o:j the three hypotheses 
should be accepted (assume that the sum of the three subclasses 
ic equal to.jA.)« Such a situation may arise, for 
instance, in tne case of a manufacturer who has to keep the 
quality of his produ^ between two limits, and wants to test, 
by sampling, whether the quality is actually between these 
limits, below the lower limit, or above the upper limit* (As- 

« t 

suiiie that the quality is meas\u*able and can be represented by a 
real number* { 


I 


3) See in this connection 16, pp 299-300 
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The reasons why such a ''trlleinma'* is a problem different 
from testing a hypothesis or estimation can only be indicated 

t 

here* It will be seen that there are many approaches to each 
problem of inferencoi and that the theory provides means of 
choosing among them by deciding that certain approaches are 
••better” than certain others. Now, one might suggest the re- 
duction of the above ” trilemma” to a problem of, say, estima- 
tion by estimating the unknown distribution function P and ac- 
cepting that hypothesis which corresponds to the subclass in 
which the estimate of P Is contained* This would be one ans- 
wer to the trilemma, but by no means the ”best” answer accord- 
ing to the standards developed. 

The most general fomulatlon of the problem of statisti- 
cal inference is this: Let 3 be a system of subclasses of the 
class ,0- of distribution inunctions. Pbr each element s of S * 
consider the hypothesis Eg which states that the xmknown dis- 
trlbution P is an element of s; denote by Hq the system of all 
sucfh hypotheses; the problcjn is to decide, by means of a sample 
which element of H 3 should be accepted . 

The problems enumerated before are special cases of this 
general problem. If S consists of two elgnents only, one being 
a subclass O) of .n. and the other its complement in -XL, the 
problem is the same as that of testing the hypothesis that the 
true distribution function P is an element of^w. If 8 *la the 
system of all elements of-O-, we have the problem of estlma- 
tlon. If S consist, of thrss classes utltb tb. rai 

tb. trlloatf*. 




II THE NEYMAN-PEARSON THEORY OP TESTING 
A STATISTICAL HYPOTHESIS 

The principles of statistical inference as developed in tto 
last two decades by R* A. Fisher, Neyman and Pearson deal with the 
problem of testing a hypothesis and with the problem of estima- 
tion but not with the general problem of statistical inference 
as it has been formulated in the foregoing pages. A further re- 
striction in these theories is that they deal only with the case 
that.CI.is a k-parameter family of distribution functions, i.e., 
that the true but unknovm distribution function F is known to be 
an element of a k-parameter family of functions 
PCx^, e^, 92. •••.Ok) 

where ere parameters. In this case the specification 

of the values of the parameters specifies completely the distri- 

f 

bution function F* 

A set of paroneter values can b^ represented by a point in 
a k-dimensional Euclidean space calle^ a parameter space . Be- 
cause of the one-to-one correspondence between elements of jn. 
and points of the parameter space we can i den tify-jCl. with the 
parameter space. If for example, are normally and in- 

dependently distribute6, each having the same distribution 
(equation(2) ), then the parameter space is a half plane where 
02^ s s mean value, and 0 ^ 63 * e » standard deviation. 

A hypothesis concerning F is expressed by the statement 
that the true tparameter point lies in a certain subset Q of the 
parameter space-TL.. As we have done before, we shall call the 
hypothesis a simple one if u consists of a single point* 

4) See, in this connection, references 12,13 and 14 
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otherwise, it is called a composite hypothesis* In the above 
example th^ statement that p » 0, o » 1 is a simple hypothesis, 
while merely stating that ^ « 0 without specifying o is a com- 
posite hypothesis* 

For the sake of simplicity we shall confine ourselves to 
the case of a single unknown parameter since this suffices to 
illustrate the basic ideas of the theories of Fisher, Neyman 
and Pearson* First, we shall deal with the Neyman-Pearson 
theory of testing a statistical hypothesis* 

we assume that the unknown distribution function is known 
to be an element of a one-parameter family FCx^^, X 2 , 9) 

and we wish to test the hypothesis 9 ^ 9 ^, 

A simple example for this case is the following i Let it 
be known that are independently and normally distri- 

buted with the same mean and tmit variances, i*e*,-fl.is the 


one-parameter family of dis^ibutions 



-00 



dv. 


and assume that we wish to test the hypothesis that 9-0* 


According to the classical theory we reject this hypothesis if 


and only if 






where o denotes a certain constant* The value of c is chosen 
in such a way that the probability of |x)>o under tl^e assmptlon 
that the hypothesis 9 « 0 is true, is so small 'that we are 

willing to reject the hypothesis* If we want this probability 

e 1*96 • 

to be 5 percent, then c « ■ ■ ■ — ■ ♦ 
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If, in the same example, we have made only two obaerva- 
tiona Xj^, Xg, an that the aample apace la the Euclldlfn plane, 
the critical region conalata of all polnta for which j(xi^X2)> 


1 OA 1 -1 96 

~ar and all pointa for which —(x^^^xg) < ’ ^ 


If the point 


repreaenting the obaervationa falla within the critical region 
(i*e«,if the arithmetic mean of the two obaervationa ia larger 
than or amaller than we ahall reject the hypothesia 

that the mean value la zero. 



>*1 


But the claaslcal theory doea not auggeat why thia critical 
region ahould be uaed. It merely proVea that the probability 
for the obaervation point to fall within the critical region 
ia five percent when the initial hypotheaia ia fulfilled* But 
there ere, infinitely many regiona which enjoy the aame property, 
and the claaaical theory doea not give any reaaona why Juat the 
one region mentioned ahould be choaen* 

In order to arrive at a diatinction, between varioua criti* 
cal regiona, Neymim and Pearaon advance the following conaidera- 
tl(^a. In making a atatement of accfptance or rejection of a 
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hypothesis, we may commit two types of errors: rejecting the ’ 
hypothesis^ althoiigh It Is true ( error of type I), or failing 
to reject It although It is false ( error of type II). If the 
hypothesis consists In saying that the unknown parameter 0 has 
a given value the situation may be smmnarlzed as follows: 


Truth or Falsehood of Statement 
Concerning the Hypothesis 0 * 9^ 


True 

Situation 

Statement Advanced 

0 * 00 

0 00 

e = 00 

Correct 

Type I error 

9 / So 

Type II error 

Correct 


©£ the critical region we mean the probability that the 
point representing the observations will fall within the criti- 
cal region, where the probability In question Is calcxilated 
under the assumption that the hypothesis Is true. (Thus, in 
the example used before, the size of the critical region was 
five percent.) This may be expressed by saying that the size 
of the critical region Is equal to the probability of conmit- 
tlng a type I error. 

The general Idea underlying the theory of Neyman and Pear- 
son Is to minimize the probability of type II errors while keep- 
ing the probability of type I errors constant. 

If R Is any region In the sample space, and E Is the point 
of the sample space which represents the observations, we shall 
denote by P(k| 9^) the probability of E lying In R calculated 
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under the assumption that is the true value of the \inknown 
parameter 9, that is to say, P(r(9x) 1^ equal to thef^Stielt jes 
integral dPCx^, 0]^) over the region R, Thus, if we 
make the hypothesis 0 * 9^ and choose R as a critical region 
for this hypothesis, the size of the critical region will be 
given by the expression P(R|9^)* If the hypothesis is wrong 
and the true value of 9 is 9]^, then the probability of avoiding 
an error of type II is P(r|9]^)» 

The expression P(R|9x}f i*e«, one minus the probability of 
an error of type II, is called the power of the critical region 
S ^^-th respect to the alternative hypothesis 9 » 9i . 

The expression P(R|9) is a function of 9* It may be plot- 
ted as a curve, the ordinate of which is equal to the size of R 

if the abscissa is 9o, and equal to the power of R with respect 

c 

to the alternative 9 « 9^ i^ the abscissa is any value 9^ 9o. 

This curve is called the power eurv€| of the region R. 

In the former example, in which the distribution was nor- 
mal with unknown mean and unit variaAoe, and the critical re- 
gion chosen was |x| > (where x is the arithmetic mean of 

i/TT 

the observations Xx#X 2 , •••,x^), the power curve can easily be 
calculated and has ths^ fom shown belows 
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In order to compare the test with other possible 

tests, we have to compare the ab'^ve pov<ier curve with the power 
curves of ether critical regions which have the same sise, five 
^ percent. 

In general, if we have two critical region? R and R* , both 
of which have the desired size, and if the power curve of R» is 

above that of R for the value 0 * then the critical region 

R* is better then R for testing the hypothesis the true value 
of 0 happens to be 0^. For the probability of committing a type 
I error is the seme whether R or R» is used, while the probabi- 
lity of committing a type II error v^en using R* is smaller thai 
when using R. If the power curve of R* is above that of R for 

each 0 (except 0^ for which the two curves coincide by assump- 

tion), then R* will be called unifomly more powerful than R. 

The test using the critical region R is celled non- admissible 
because its use is, under all circumstances, less favorable than 
the use of R* • 

In order to make this clear, let us assume that a large 
number of samples is drawn* each of which consists of N indivi- 
dual observations. Let h be the number of such samples and let 
two statisticians, whom we will call 8 and S', teat the seme 
hypothesis, using each of the h samples. ^ Assume that S uses the 
critical region R for testing while S' bases his tests on the 
region R* . S and S' will each obtain M answers to the question 
as to whether the null hypothesis (the hypothesis to be tested) 

should be rejected. Some of these answers will be* right, others 

• • 

will be wrong. Let us cbmpare the records of S and S'. We hava 
to distinguish between the case that the null hypothesis is true 
and (he case that it is fa:^se. a)ln the first (xase, the answers 
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obtained by each statistician may either be that the hypothesis 
is to be accepted - these answers are right; or that it should 
be rejected • these answers are errors of type I. The probabi- 
lity of committing a typo I error by testing the null hypothesis ' 
from a sample drawn at random is equal to the size of the criti- 
cal region used in testing. If h is large, it is practically 
certain that the relative frequency of typo I errors will be ap- 
proximately equal to their probability, i.e., to the size of the 
critical region. Since R end R* have, by assumption, equal siss» 
each of the two statisticians will commit approximately the 
same number of errors. b)lf the null hypothesis is false, some 
of the h answers obtained by each statistician will correctly 
reject it, while others will accept it, thus committing errors 
of type II. If K is large, the relative frequency of correct 
answers will be approxioiately equal tot the power of the test 
used which we have pointed out is the probability of avoiding a 
type II error. By assumption, the power of R» is greater than 
that of R, regardless of what the true value of 9 is, provided 
only that 9 la different from 9^. Therefore, the relative fre- 
quency of wrong answers obtained by S will tend to be greater 
than the relative frequency of wrong answers obtained by S'* 

Thus, if the null hypothesis is false (no matter what the true 
value of 9 is), it is practically certain that S will make more 
falsa statrments; while if the null hypothesis is true, S and 
S' will commit an approximately equal number of false statenants. 
The method use^d b^ S', i.e., the application of the critical re- 
gion R', is therefore superior. lo the method used by S, is., the 
application of thb, critical region R. 
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These considerations decide the choice between two criti* 
cal regions of equal size if one of them is uniformly more 
powerful than the other, l.e., if the power curve of the former 
is above that of the latter for all values of 0 except 9 q (for 
which the power curves coincide). On the other hand, if the 
power curve of R» is above that of R for some values of 0, but 
below It for other values of 0, then we cannot choose one of 
the two regions without introducing further principles on which 
to base the choice. 

If, for all values of 0, the power curve of a region R is 
never below that of any other region R» of equal size, then R 
is called a uniformly moat powerful region , and the test cor- 
responding to R a uniformly most powerful test. 

The first principle for selecting a test is this : whenever 
we can find £ uniformly m*ost powerful test , we shall prefer it 
to all other tests using regions of the same size . Unfortun- 
ately, uniformly most powerful tests do not exist in most cases. 

In the example which we have used on page 11 let us consld- 

er the region R» determined by the Inequality x > . It 

iTrr 

can easily be shown that R’ (like the region R considered be- 
fore) has the size .05. The power curve# of R and R* are shown 
below: 
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We ean see that for all 8 > 0, R* is more powerful than R, 
and Tice versa for 8 <0. In such oases further principles have 
to be formulated on which the choice should be based.* It is 
clear that the choice we make will depend on our a priori de- 
gree of belief in the truth of the different possible values of 
8. For instance, if we know a priori that 8 cannot be negative^ 
then we shall prefer Rf 

noreover, it can be shown that R* is unifoi^y most power- 
ful if the parameter space is restricted to non-negative values 
of 8* If negative and positive values of 8 are considered a 
priori as equally possible we will most likely prefer R to R* • 

This example shows also that the choice of the critical 
region depends essentially on.n.* If .IX. consists of all non- 
negative values of 8 then the region R* is a uniformly most 
powerful test. If jxconsists of all nqn-positive values 8, then 

the region R^» given by is a uniformly best region. 

/n" 

Finally, if JX.consists of all real values 8, then the use of the 
region R seems to be more reasonable than that of R* or R*'« 

I 

Since unifomly most powerf\il regions rarely exist, Neynan 
and Pearson introduced a further principle on which the choice 
of the critical region should be based, namely, the principle 
of unbiasedness. A test is called unbiased if the power func- 
tion of the test has a relative minlmua at the value 8 « 8 q 
where 8o if the hypothesis to be tested. 

Some rationalisation of this principle ean be givens Sup- 
pose a test is* biased, then for somf value 0 ^, in the neigtibor- 
hood of 8 q, the power of the teat is less, than the sise of the 
region. But this means that the probability of rejecting the 
hypothesis 8 » 85 is larger if 8^ is than if 8x is true. 
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which is not a desirable situation. 

In general f an infinity of unbiased tests exist, hence we 
need a further principle in order to select a proper test from 
among them. We define as a uniformly most powerful \mbiased 
test one which is at least as powerful or more powerful, with 
respect to all alternate hypotheses, than any other unbiased 
region of equal size. If a uniformly most powerful unbiased 
test exists, and if we accept the principle of unbiasedness, 
then it is obvious that it is the most advantageous test to 
use. Neyman and Pearson called a critical region corresponding 
to a uniformly most powerful unbiased test a critical region of 
type 

Referring to the example previously considered, the criti- 
cal region given by |x| > e is a region of type for testing 
the hypothesis in questioa. Another example of a region of 
type A^ is the following i Let Xi,...,Xn be independently and 
nomally distributed with zero means and a common variance. 
Then, for testing the hypothesis that the common variance o^ is 
equal to o^^, the critical region consisting of all points of 
the sample space which satisfy at least one of the ineqxialities 

♦ ... 4- Xn^>oi or X’^ ♦ -f Xjj®< og , 

is a critical region of type A^ if the constants c^ and Cg are 
properly chosen. • 

The region of type A^ exists in an important, but very re- 
s trio ted, class of cases; the^l^e are many instaqoes in which it 
does not exist. Therefore, Nejfman and Pearson have introduced 
a third type of region, known as a region of type A* The re- 
gion iR is said to be of type A if its power function P(w/e) 

• ^ 
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such that 


1 ) mmi 


do 


= 0 


'0 * 0 , 


and 


2) 


a^p(Rle) 




a^p(R»l8) 


ae' 


6=00 


6»6o 


for all regions B* which satisfy 1) end have the same size as R* 
The first condition restricts the region to be unbiased. The 
second requires the power function of a region of type A to have 
a greater curvature than that of any other unbiased region of 
the same size. To put it crudely, it means that the region is 
most powerful in the neighborhood of 9^. 

A critical region of type A exists \mder very weak condl- 

r 

tions which are fulfilled in most of the practical cases. How- 
ever, the objection can be raised against a region of type A 
that we are much more concerned with the behavior of the power 
function for alternatives 9 v^hlch ar6 far from 9o than for those 
in the neighborhood of 9o. In spite of this, as we will see, a 
good Justification of the use of a type A region can bo given 
in the li^t of some x«^cent results. 



^III R. A. FISHER'S THEORY OP ESTIMATION®^ 

The problem of estimation of the unimown parameter 9 is 
the problem of finding a function » , 7 l ^) of the observa- 

tions such that t can be considered in a certain sense as a 
’’good** or "best** estimate of 9* Since the estimate t(x^,«..^) 
is a random variable, we cannot expect that its value should 
coincide with that of the unknown parameter, but we will try to 
choose t(x^, • • . in such a way as to make as great as pos- 
sible the probability of the value of t lying as near as pos- 
sible to the value of the unknown parameter 9. 

This is a somewhat vague formulation of the requirement 
for a "good** or "besi;" statistical estimate. It can be made 
precise in different way.%. Markoff^ for instance, defines 
the notion of a "best" estimate as follows i A statistic t (we 
shall call any function of the observations a statistic) is a 
best estimate of 9 if 

% 

(1) t is an unbiased estimate of 9, i.e.^BgCt) - 9 iden- 
tically in 9 where £Q(t) denotes the expected value of 
t under the assumption that 9 is the true value of the 
parameter. 

(2) Eq( t-9)^^ Eq( t* -0)^ identically in 9 for all t* which 

satisfy (1). « 

This definition of a "best estimate" seems to be a reasonable 
and acceptable one since, in, general, the naller'the variance 
of t the greater is the probability that t will lie in a small 


5) See references 3-0 

6) See reference p*344 
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n«lgbborhood of 9. It should be remarked that although (bj 
virtue of T8hebl8heff*8 Inequality) smallness of the varlanoe 
implies that the probability of t lying in a small neighbor* 
hood of 0 is small, the converse is not necessarily true* It 
may happen that a statistic t has a large variance and» never- 
theless, the probability of t lying in a small neighborhood of 
9 is high* This oircumstanoe constitutes some argument against 
Markoff's definition* A more serious difficulty is, however, 
the fact that a best estimate in Markoff's sense seldom exists* 
R. A* Fisher's theory of estimation is on the prin- 

ciple of the maximum likelihood* It is aesuned that a probabi- 
lity density 


p(*i# • • • 

exists in the sample space, i*e*, for any measurable subset W of 
the sample space * 


L#***#*n'®^ "I 1 *** ) ®)dv^, . . *dVy^* 


P(W|9) » p(x^,.**,Xjj, 9} dx* 

In particular, the cumulative distribution function is 
given by 

■=00 =bo '^OD 
The maximum likelihood estimate is defined as 

that value of 9 for which p(z^, ***,X2^,9) becomes a maximum* 

Now assume that are n independently distributed ran- 

dom variabp.e8 each having the same distribution* This can al- 
so be expressed by saying that m^, ***,X|} are n independent ob- 
servations on* the same random variable X* The main result of 
« * 

Fisher's theory of estimation can be sta^ied as follows t If 
x^, ***,Z|| are n independent observations (n « 1,***, ad inf*) 
on the same random variable X and if |the distribution of X 
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aatlsfias certain oonditiona (which are not too restrictive and 
in practical application are frequently fulfilled), then ^ is 
an efficient estimate . The definition of an efficient estimate 
is given as follows} 

A sequence ft^^^ (n » !#•••* ad inf.) of statistics is 
called an efficient estimate of Q (the subscript n indicates 
the number of observations of which t^ is a function) if 

(1) the limit distribution ot/n (t - 0) is a 

n 

normal distribution with zero mean and finite 

% 

variance, and 

(2) for any sequence {t^| of statistics which satis- 
fies (1) 

q 2/^|2 ^ ^ 

where o^ « lim [[vn (t^^ - 0)^ ^ 

and o«^ » lim Q/n (t^ - 0)^^ 

The ratio is called the efficiency of j 

^ Which is always 4ml. 

Vaguely speaking, in* large samples the maximum likelihood 
estimate has the smallest variance compared with any other 
statistic which is in the limit normally distributed. The re- 
striction of the comparison to statistiSs which are in the 
limit normally distributed seems to be a serious one. However, 

as recent results show, the maximum likelihood estimate has a 

# 

much stronger property than efficiency, and it can be con- 
sidered as a "best** large sample estimate of 9 compared even 

• . 7 ) 

with statistics which are not normally distributed in the limit. 


7) See reference 20 
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The question of consistency and limit distribution of the 
maximum likelihood estimate has been treated by H. Hotelling,?. 
A complete proof has been given by J. L« Doob, 1. 

As an example, let be n independent observations 

on a normally distributed variate X with unknown mean and \inlt 
variance. It can easily be verified that the maxima likeli- 
hood estimate of 0 is given by 

en(xi,...,Xn) - 

Let tj^Cxj^, • • • ,x^) be the median of the observations x^,...,x^. 
It can be shown that the limit distribution ofi^ (t^^ - 9) is 
normal with zero mean and variance H • Hence, the efficiency 
of the median for estimating 9 is equal to ^ - 0.6366. •• 

TT 



IV THE THEORY OP CONFIDENCE INTERVALS 


The procedure of eatlmatlon, as I formulated It here, Is 
also called estimation hy a point * For practical applications 
the estimation by intervals seems to be much more Important. 
That is to say, we have to construct two functions of the ob- 
servations 9 (E) and 9 (E), where E denotes a point of the sam- 
ple space, and we estimate the parameter to be within the in- 
terval ^(E) = Qfi (E), *5 (E)^. In connection with the theory 
of Interval estimation, R. A. Fisher Introduced the notion of 
fiducial probability and fiducial limits, while Neyman®) dev- 
eloped the theory of Interval estimation based on the classical 
theory of probability. I shall give here a brief outline of 
Neyman’s theory. ^ 

Before the sample has been drawn the point E is a random 
variable and, therefore, the values of 9 (E) and 9 (E) are also 
random variables. Hence, before the sample has been drawn we 
can speak of the probability that 

(3) £ (E) ^ e (E) 

even if 9 is considered merely as an unknown constant. After 
the sample has been drawn and we have obtained a particular 
sample point, say Eq# it does not make sense to speak of the 
probability that ^ 

(4) e (Eq) i ®«'5 ‘(Eo), 

if 9 is merely an unknown constant. Each term ill the inequal- 

• • 

ity (4) is a fixed constant, and the inequality (4) is either 


8) See reference 
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right or wrong for those particular constants. It wo\ild be pro- 
per to talk about the probability of (4) if 9 itself could be 

i 

considered as a random variable having a certain probability 
distribution, called an a priori probability distribution. In 
this case we understand by the probability that (4) holds the 
conditional probability, called also a posteriori probability, 
under the assumption that E » Eo occurred. If an a priori dis- 
tribution of 9 exists and if it is known then, using Bayes* form- 
ula, we can easily calculate the a posteriori probability dis- 
tribution of 9. However, in practical applications we seldom 
meet cases where the assumption of the existence of an a priori 
probability distribution seems to be Justified; and even in 
those rare cases in which the latter assumption can be made, we 
usually do not know the shape of the a priori probability dis- 
tribution and this makes the applicatisn of Bayes* theorem Im- 
possible. For these reasons the theory of interval estimation 
has to be developed in such a way that its validity should not 
depend on the existence of an a priori probability distribution. 
Hence, in this theory we shall speak only of the probability of 
(3) but never of the probability of (4). 

For any relationship R we will denote by p[Rie] the proba- 
bility of R calculated under the assumption that 9 is the true 
value of the parameter. 

A pair of functions 9 (E) and 7 (E) is called a confidence 
interval of 9 if 

1) 9 (E\^ 7 (E) for all .points of E 

2) p[[e (B)« 9 (E) I 9^ a a for .11 values of 9, 

where a is a fixed, constant called the confidence coefficient . 
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The practical meaning and importance of the notion of the 
confidence interval is thiai If a large number of aamplea are 
drawn and*if in each oaae we make the statement that 2 is in- 
cluded in the interval^® (B), 9 (E)] » then the relative fre- 
quency of correct statements will approximately be equal to a* 

In general, there exist infinitely many confidence inter- 
vals corresponding to a fixed confidence coefficient a, and we 
have to set up some principle for choosing from among them. It 
is obvious that we want the confidence interval corresponding 
to a fixed confidence coefficient to be as **short** as possible. 
We have to give a precise definition of the notion ** shortest** 
confidence interval . 

A confidence interval ^(E) (S)> ^ io called a 

shortest confidence interval corresponding to the confidence 
coefficient a if • 

(a) p[^e (E) * 9* ? (E) I -o and 

(b) for any confidence Interval (E) which satis- 
fies (a) 

pQe (E) * (E)le^*p[fi* (E)d e»i?*(E)|^ 

for all values 9* and 9** of 9. 

If a shortest confidence interval exists, it seems to be the 
most advantageous. Unfortunately, shortest confidence inter- 
vals exist only in quite exceptional cases. Therefore, we have 
to introduce some further principles on which the choice should 
be based. Such a principle is the principle of unbiasedness. 

A confidence interval is called an upbiased confidence 

interval corresponding Ao the confidence coefficient a if 
p(£ (E) . ® * 5 (E) 1 «] • a 

and P^ (E)a 9*^7 4E} I 9^ * a for all valuea 9'an40'« 
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A confidence interval (f(£) is called a shortest unbiased 
confidence interval corresponding to the confidence coefficient 
a if ^(E) is an unbiased confidence interval with the confidence 
coefficient a and if for any unbiased confidence IntervsLl J*(E) 
with the same confidence coefficient, we have 

p[|e (E)* e* i i (E) I 0^ ^ P Qi' (E) « 0* ^ 0' (E) I 0*^ 
for all values e* and 9**. 

If we accept the principle of unbiasedness, the shortest 
unbiased confidence Interval seems to be the most favorable one. 
Even shortest unbiased confidence Intervals exist only In a 
restricted, but important, class of cases. If a shortest un- 
biased confidence Interval does not exist, Neyman proposes the 
use of a third type of confidence Interval, which he calls 
short xinblased^ confidence interval . An unbiased confidence 
Interval <r(E) with the confidence coefficient a Is called a 
short unbiased confidence Interval If 


ao"2 


2 

p[S(E)*0'a5(E)|0^ « -2^ p[£'(E)« 0'*¥'(E) I 0^ 


0"=9* 




for all 9* and for all unbiased confidence Intervals cT’CE) with 
the confidence coefficient a* 

I have discussed olily the case of a single unknown para- 
meter* In the case of several unknown parameters some new prob- 
lems arise, which do not occur In the case of a single para- 

r 

meter. However, I shall not discuss them, since the case of a 
single parameter already provides a good illustration of the 

I b 

basic ideas of the theories of Fisher, Heyman and Pearson. 



V ASYMPTOTICALLY MOST POWERFUL TESTS AND ASYMPTO- 
TICALLY SHORTEST CONFIDENCE INTERVALS^) 

As W6 have seen. If a uniformly most powerful (unbiased) 
test and a shortest (unbiased) confidence Interval exist, they 
provide a satisfactory solution of the problem of testing a 
hypothesis end the problem of Interval estimation. Unfortuna- 
tely, they exist only In a restricted class of cases. As sub- 
stitutes for them the use of a critical region of type A and a 
short confidence Interval, respectively, have been proposed. 

The appropriateness of the region of type A seems somewhat 
doubtful, since we kre more Interested In the behavior of the 
power function at values of 9 far from the value 9^ to be tested 
than at 'values of 9 near ^to 9o. Similar objections can be 
raised to the use of a short confidence Interval. Recent In- 
vestigations show, however, that the situation Is much more 
favorable than appears at first glance. It Is shown that the 
difficulties arising because of the non-existence of unlfomly 
most powerful unbiased testa and shortest xmblased confidence 
Intervals gradually disappear with Increasing size of the 
sample, since so'-called asymptotically most powerful unbiased 
tests and asymptotically shortest unbiased confidence intervals 
practically always exist. 

m 

We shall aAsume that the observations Xi,...,X]^ are n In- 
dependent observations on the sane random variable X whose dls- 

• • 

trlbution function Involves a single unknown parameter 9. We 
shall also assume that X has a probability density function. 


9) See references 17-20 
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aay f(x^e)« Since in our discussions the nunber of observa- 
tions n will not be kept constant » we shall indicate the dimen- 
sion of the ample space by proper subscripts. For instance^ 
a critical region in the n- dimensional sample space will be 
denoted by a capital letter with the subscript n. A point of 
the n-dimensional sample space will be denoted by and a 
confidence interval based on n observations by^^^CEj^)* 

For any region Un denote by O(T^) the greatest lower 
bound of P(I^|9). For any pair of regions and denote by 
L(Un»T{j) the least upper boxind of » „ 

A a.quenee •••«.(! Inf.) of regions is said to be 

an asymptotically most powerful test of the hypothesis 9 « 9^ 

on the level of significance £ if P(W|9 q) » a and if for any 

sequence {2^^ of regions for which P(Z|^{9 q) » a, 

list sup » 0 holds. 

n-K)o 

A sequence (nal^.e.^ad inf.) of regions is said to be 

an asymptotically most powerful unbiase d test of the hypothesis 
9 » 9 q on the level of significance a if P(Wjj| 9^ )-lj^ 0(Wn)-a 
and if for any sequence (Tig^ot regions for which PCZ^J 9^ )« 

11m G(Z|i) * a the inequality 11m sup L[Zg^p\ig^) ^ 0 holda. 

Let Pn(0»a) be defined by 

P(2nl®) 

with respee<: to all regions for which P(Z2^|9 q) » a* We will 
call P|j(9,a) the envelope function corresponding to the level 
of significance a*. Similarly let P,^ (9»a) be the least upper 
bound of P(2nt9) with respect to all unbiased critical regions 
Zn which have the sise a* We will call P^ (9, a) the unbiased 
envelope function corresponding to the' level of significance /t. 
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The two previously given definitions are equivalent to the 
following ^woi 

A sequence of regions Is said to be an asyiriptotlcally 

most powerful test of the hypothesis 0 « 0 q on the level of 

significance a If P(W^| 0 q) » a and 

11m /Pn(«,a) - P(W„1®)} =* 0 

n=QO L ^ 

unlfomly In 0 . 

A sequence [^nj regions Is said to be an asymptotically 
most powerful unbiased test of the hypothesis 0 s 00 .on the 
level of significance a if P(^nl ^o) * ^ 



uniformly In 0 . 

Let * 0 • be the maximum likelihood estimate of 0 

In the n-dlmenslonal samp^e space* That Is to say, 0^ denotes 

the value of 0 for which the product*^ f(xQ, 0 ) becomes a maxl- 

a=l 

mum* Let be the region defined by the Inequality 

0 o)-c^ , defined by the Inequality /n (^n*®o^^®n 
and let be defined by the Inequality j/n (0,^- 0^1 The 

constants dn. eg chosen In such a way that 

P(w;|9j = P(w;il «o) “ 

It has been show that under certain restrictions on the proba- 
bility density r(x,Q) the sequence Is an asymptotically 

most powerful test of the hypothesis 0 » 00 If 0 takes only 
values ^ 00 * Similarly {wgj' Is an asyiuptotlcally most powerful 
test If 0 takes only values^ d^* Finally *ls an asympto- 
tically most powerful unbiased test If 0 ,can take any real vilue. 
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There ere also other asymptotically most powerful teats. 
Let be the region defined by the Inequality i 


1 n 3 

- 2- — log f(x , 
a=l 39 ^ 






defined by the inequality 


log f(x,. 


and W- defined by the Inequality 


I I log f(x^. e^)lacn 


where the constants Cj^, Cj\ and c{^ are chosen In such a way that 

p(w^|0^) = p(w;;|e^) = p(wje^) » a. 

f 

Then fw^jls an asyiriptotlcally D*ost powerful test of the hypo- 
thesis 9 = 9 q If 0 takes only values ^9^, Similarly, is an 
asymptotically most powerful test if 9 takes only values ^9o. 
Finally is an asymptotically most powerful unbiased test 

if 9 can take any real value. 

The sequence ^An(9o)} 1® ^ asynptotically most power! ul 
unbiased teat of the hypothesis 9 = 9 q, where Aj^(9q) denotes 
the critical region of type A /or testing the hypothesis 9 « 9*^. 

Since there are many asymptotically most powerful tests, 
the question arises whether they are all equally good or 
whether one can be preferred to another. It is clear that if 
and^W^ j are two asymptotically most powerful unbiased tests, 
then for sufficiently large n they are equally good. In fact, 
for sufficiently large n both power functions P(W 2 ^|e) and 
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P(Wj!^|0) are in a small neighborhood of P^(9,a) 

However, li^ey may behave differently In the sense that wltdi In- 
creasing n one power function, say P(W^IO) approaches the en- 
velope function faster than P(Wj^|9) does. In such a case it 
seems preferable to use especially If the sample Is only 

moderately large. If the sample Is so large that both power 
functions are In a small neighborhood of the envelope function, 
then It Is Immaterial whether we use or W^. 

These considerations lead to the idea that it is preferable 
to use that asymptotically most powerful (unbiased) test 
for vhlch the approach of P(Wjjl9) to the envelope function is, 
in a certain sense, fastest. 

A region Wj^ Is called a most stringent test of size o for 
testing the hypothesis 9 « Oq if P(Wj^|9q) = a and 


1 


u.b. 

9 


[Pn(e.o)-P(Wn|0)] 


l.u.b.[Pn(e,a)-P(Znle3 


for all Zjj for which P(Zj^|9q) = a. The abbreviation l.u.b. 
means ^least upper bound with respect to 9.” 

If is for each n a most stringent test, its power func- 
tion will approach the envelope function, In a certain sense, 
faster than, any other power function. seems, therefore, 
desirable to use a most stringent test. A region of type A is 
not exactly a most stringent test, but probably it is quite 
near to it (this question has yet to be investigated), and 
this would provide a very good Justification for the use of a 

• 9 

type A region. The mathematical difflcixlties in finding ex- 
plicitly a most stringent test are considerable. 
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Let cfn(®nJ “[~gn(®n)» ^e an interval function and 

denote by PQrjj(Ejj) Ce»|e”"] the probability that <^n(^n) 
cover 9* under the assumption that 9** is the true value of the 
parameter* 

A sequence of interval functions|</^(£j^)| (n^l^S, • • • ,ad iif) 
is called m asymptotically shortest confidence intex»val of 9 
if the following two conditions are fulfilled: 

(a) C9l^ * a for all values of 9 

(b) For any sequence of interval functions 

{^n^®n^} (n=l,2,.**, ad inf.) which satisfies 

(a), the least upper bound of 

p[fn(En) C6*|e'^ - p[fA(En) Cfl'l^ 
with respect to 9* and 9** converges to zero 
with n -^ 00 • 

A sequence of interval functions (nal,2,..»*^ td ini} 

is called an asymptotically shortest tinbiased confidence in* 
terval of 9 if the following three conditions are fulfilled: 

(a) pQTjjCEjj) C9|^ » a fcfr all values of 9 

(b) The least upper bound of p[^n^®n) C9*l9^ with 
respect to 9* and 9^ converges to a with n^co 

(o) For any sc..;>j*ence of interval functions 

which satisfies the conditions (a) and (b), the 
least upper bound of 

ce*l«^ - pG^^(En) C9'le^ 

with respect to 9’ and 9^, converges to zero with 
n-e‘oo. 

Let 0^(0} ^e a positive function of 9. such that the proba* 
Mllty that z A log f(3Cp,e)j • c^( 9 ) 1# equal to a 
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oonatant a under the aaeumption that 9 la the true value of the 
parameter.^ Denote by 9(E^) the root in 9 of the equation 

g ^ Z log f(xp,e) « Cn(e) «nd by 5(En) ^he root of 

^ Z log f(xp(fl) a -Cn(0)« It haa bean ahown that under 

aome reatrlctlona on f(x,9) the Interval ■[39(En)»9(En^ 

la an aaymptotically ahorteat \mbiaaed confidence interval of 
9 correaponding to the confidence coefficient a a Thia con- 
fidence Interval la Identical with that given by Wilka^^L 
The definition of a ahorteat confidence Interval underlying 
Wilka* inveatigationa la aomewhat different from that of Ney- 
man*a, which haa been uaed here* According to Wilka, a con- 
fidence interval ^(E) ia called ahorteat in the average if the 
expectation of the length of ^(E) ia a minimum • The main re- 
ault obtained by Wilka can be formulated aa followas The con- 
fidence interval in queation ia aaymptotically ahorteat in the 
average compared with all ^sonfidence intervale the endpointa of 
which are roota of an equation of the following type: 

Z h(Xfl, 9) » ♦ Cni(9). 

P ^ 

In the preaent inveatigation auch a reatriction ia not made* 

The confidence interval in conaideration ia ahown to be aaymp- 
totieally ahorteat compared with any unbiaaed confidqpce in- 
terval* 

Now let C2}(9) be a poaifive function of 9 such that the 
probability that -»9j ia equal to a constant a under 


10) See refereno(a 22 



the assumption that 9 la the true value of the parameter* De- 
note by 9(En) root In 9 of the equation %i • 9 ^ 

by the root of - 9 = -C^(9). Consider the interval 

^(En) some restrictions on the den- 

sity f(x, 9), it can be shown that <f(E^} is an asymptotically 
shortest unbiased confidence interval* 

This is a much stronger property of the maximum likeli- 
hood estimate than its efficiency and gives a Justification of 
the use of the maximum likelihood estimate also in the light of 
Neyman's theory of estimation* 



VI OUTLINE OP A GENERAL THEORY OP STATISTICAL INFERENCE 


The theoriea of Fisher, Neyman end Pearson are restricted 
In two respects. First, they consider only the problem of 
tasting a hypothesis and that of estimation by point or in- 
terval. The second restriction is that only the case in which 
..TUls a Ic-parameter fanlly of distribution functions is in- 
vestigated. Both restrictions are serious from the point of 
view of applications. 

There are many important statistical problems which are 
neither problems of testing a hypothesis, nor problems of es- 
timation. We have already given such an example in Section !*• 
As a further illustration, let us consider the following case: 
Let be p Indepehdently and nomally distributed ran- 

dom variables with unit variances and unknown means 9x,...9p* 
Furthermore, let ^ii>-«^in ^ independent observations on 

X]^(l » 1,2, ••.,p). Suppose we test the hypothesis that 

^ ••• * 9p = 0, and decide to reject this hypothesis on the 
basis of the pn observations » 1,2, •••,n; i - 1,2, ••.,p). 

In such cases we are usually interested ^n knowing which mean 
values are not zero, i.e.,we wish to subdivide the set of p 
mean values 92»«««»9p into two subsets, such that one of them 
contains the mean values which are zero and the othex* the mean 
values which are not zero. This subdivision has to be done, of 
course, on the basis of the pn observations Rore pre- 

oisaly, we have to deal* with the following statistical problem: 
There exist 2^ different subsets of the set (9^,.«.,9p). De- 
note these subsets by respectively.* Let • 

37 
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(k s; 2^) be the hypotheaie that the mean values contained 

in the set are equal to sero and all other mean vi^ues are 
unequal to sero* On the basis of the pn observations we have 
to decide which hypothesis from the set of the 2^ possible 
hypotheses should be accepted* This problem cannot be con- 
sidered as a problem of testing a hypothesis nor a problem of 
estimation* 

A similar problem arises if we wish to classify a set of 
regression coefficients into the class of non-zero and the 
class of zero regression coefficients. In problems of regres- 
sion we often take it for granted that the regression in ques- 
tion is a polynomial and we have to determine on the basis of 
the observations the degree of the polynomial to be fitted* 

That is to say, we have to decide on the basis of the observa- 

t 

tions which hypothesis of the sequence of hypotheses 
Hx, H2» H5, •**, Hn»*** should be accepted* The syirbol 
(n s 1,2 ,***) denotes the hypothesis that the regression is a 
polynomial of n-th degree* These exatiples illustrate suffici- 
ently the necessity of the extension of the theory of statis- 
tical inference to the general case as formulated in Section 1* 

The case in whichik.^^dmot be represented as a k-parameter 
family of distribution functions is quite important* As an 
illustration, consider the following problemr Let (xQ^,y^),*** 
(Xn»yn) 't>9 n independent pairs of observations on a pair (X,Y) 
of random variables* Suppose we wish to test the hypothesis 

that X and Y are independently distributed and we do not have 

< 

any a priori knowledge about the joint distribution of X and Y* 
In this caseUl. consists of all distribution functions 
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P(xi,yi, . . Which can be written in the fom 

P(^l» yi# • • • ) * 5 (^l» yi ) • • • ^ (*n* Tn ) 
where I may be eui arbitrary function. The subclass (o consists 
'of all distribution functions PCxj^^yi, • . .,Xjj,yjj) which can be 
written in the fom 

F(Xi,yi, ...^x^^y^) = ^(xi)y (y^) ^(xg)!)^ (y2)*-* ^(^)+(yn)- 
Hence f XL. cannot be represented as a k-parameter family of 
functions* 

The problem given above as an illustration has been treat* 
ed by H. Hotelling and Margaret Pabst (see reference 8). An- 
other problem, v-ihere XI is the class of all continuous distri- 
butions, has been considered in paper (see reference 21). We 

shall give here an outline of a theory of statistical inference 

11 \ 

dealing with the following general problem ' : 

Let • • • ,Xj^ be a s^t of n randan variables. It is knovi 
that the Joint probability distribution function P(x 3 ^, . . . ,Xj^) 
of is an element of a certain class XX. of distribu- 

tion functions. Let S be ^ system of subclasses of XX. For 
each element u) of S denote by H^,} the hypothesis that the true 
distribution P(x^, • * • ,Xj^) of Xxr****^^ is an element of u). 
Denote by Hg the system of all hypotheses corresponding to all 
elements of S. Let x^ be the observed value of X^^ (isl,...,n). 
We have to decide by means of the observed sample point 
En » (xj^, . . .,Xjj) which hypothesis of the system Hg of^hypo- 
theses should be accepted. That is to say, for each hypothesis 
H(,) we have to determine a region of acceptance Mco in the n- 
dimensional sample spac^. The hypothesis H^ will be accepted 

« 

^*11) This theory has ifeen developed in reference 16 
for the case thatXXls a k-parameter family 
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If and only If the sairiple point falls In the region Mw* The 

regions Mq and course, disjoint for w w* . Pur- 

themore, 2 Is equal to the whole sample space. The statls - 
0) 

tlcal problem Is that of the proper choice of the system Mg of 
the regions of acceptance . 

The choice of the system Mg of regions of acceptance Is 
equivalent to the choice of a function defined over all 

points En of the sample space. The value of the function 
u)(E^) Is an element of S determined as follows: Since the ele- 
ments of Mo are disjoint and since 2 M^^ Is equal to the whole 

CO 

sample space, for each point there exists exactly one ele- 
ment CO of S such that Ej^ Is contained In Mqj. The value of the 
function co(En) element co of S for which Ej^ Is an ele- 

ment of Mq. Hence, we can replace Mg by the function 
and for each sample point E^ we decide' to accept the hypothesis 
H^)(En)* We will call a)(En) the statistical decision function. 
Hence , the statistical problem Is that of choosing the statis - 
tical decision function co(En) » n 

The choice of (i}(En) will essentially be affected by the 
relative Importance of the different possible errors wo may 
conurdt. We commit an e^jgp whenever we accept a hypothesis H^ 
and the true distribution Is not an element of (o. We Introduce 
^ weight function for the possible errors. !the weight fxmctlon 
w[^F,(«)]] Is L reel valued non-negative function defined for all 
elements F of and all elements (o of 3, expressing the re- 
lative Importance of the error committed by accepting when 
F Is true. If F Is an element of w then w[^F,a)3 » 0, otherwise 
wQf,(^> 0. The question as to how the fozm of the weight func- 
tion wQf,(i)^ should be chosen Is not a mathematical nor statistical 
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one* The statistician who wants to test certain hypotheses 
must first detemine ihe relative importance of all possible 

f 

errors and this will depend on the special purposes of his in- 

» 

vestigation* If this is done, we shall in general be able to 
give a more satisfactory answer to the question as to how the 
statistical decision function should be chosen. In many cases, 
especially in statistical questions concerning industrial pro- 
duction, we are able to express the importance of an error in 
monetary terms, that is, we can express the loss caused by the 
error considered in terns of money. We shall also say that 
w is the loss caused by accepting when P is true. 

Suppose that we make our decisions according to a statis- 
tical decision function w(Ejj), and that the true distribution 
is the element F{xi, , » . ofXl-* Then the expected value of 
the loss is obviously givo*n by the Stleltjes Integral 

(5) rw[p,«(En)]dP(xi, = tCfI , 

whqi^e the integration is t(> be taken over the whole sample space 
We shall call the expi*e8sion (5) the risk of accepting a 
false hypothesis when P is the true distribution function. 

Since we do not know the true distribution P we shall have to 
study the risk rQp^ as a function of F, We shall call this 
function the risk function . Hence, the risk function is defined 
over all elements P of,/\.. The form of the risk function de- 
pends on the statistical decision function and on the 

weight function w[^P,ti)^ • In 'order to express *13118 fact, we 
shall denote the risk fUnction associated with the statistical 
decision function ^d the weight function w[p,(o]al8o by 
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r ^l« (£„), «»[P,(ir]J 
We Introduce the following definitiona: 

Definition 1 . Denote by and b)*(E|^) two statistical 

decision functions for the same system Hg of hypotheses. We 
shall say that (^(Ej^) and <i)* (Ej^) are equivalent relative to the 
weight w[p,c^ if the risk function r^F|fa>(E|^)^ 
is identically equal to the risk function r |u'(E 2 ^)#w[f,^| 
i.e.jfor any element F of 11 we have 

r^Fl«(En), w[p,<3j “ r|p|»'(En), wQp.fcQj . 

Definition 2 . Denote by <i>(Ej^) and two statistical 

decision functions for the same system Hg of hypotheses. We 
shall say that («)(En) is uniformly better than z*o^^tive 

to the weight function w[F,<^if ci)(En) and wME^) are not equiva- 
lent and for each element F of A we have 

r^F|«(En)p w[f,w] j ^ r ^FlwtCE^), w • 

Definition 3 . A statistical decision function (oCE^) is 
said to be admissible relative to the weight function w[f,(i^ 
if no unifonnly better statistical dtecision function exists re- 
lative to the wei^t function considered* 

First principle for the choice of the statistical decision 
function . We choose a*Sl:atistioal decision function which is 
admissible relative to the wei^^t function considered. 

There can scarcely be given any argument against the ac- 
ceptance of the above principle for the selection of Ci)(E||}. 
However, this principle does not lead in general to a unique 
solution. There exist in general many admissible statistical 
decision functions. We need a second principle for the choice 
of a best actoissible decision function. 
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The choice between two adniaeible decision functions 
and (1)1 (Ej^) may be affected by the degree of our a priori con- 
fidence in^the truth of the different elements of S\ • Suppose^ 
• for instance, that for a certain element of we have 

r ^Pllw(En),wjp,(iQj < r 

for another element Fg of O. we have 

r |p2lu(En.),v*[p,<^^ > r |P2l«'(Ejj),w[p,«[| 
and for any other element P P^^, P2 we have 

r ^PlwCBn)#'^ ^ |plw*(Kn)#w[ 2 p,(;Q| . 

If we have much greater a priori confidence in the truth of P^ 
than in that of Pg, wo will probably prefer (o(En) to c»)»(En)« 

On the other hand, if we think a priori that Fg is more likely 
to be true than Fj^, we may prefer (*)*(En) to <*)(Ej^)« 

Suppose we can express our a priori degree of confidence 
by a non«-negative additiv<| set function p(T)) defined over a ceiv 
taln system of subsets ofil , where pCnJ * 1 * That is to say 
the value of p( t)) expresses the degree of our a priori belief 
that the true distribution is an element of the subset t)« In 
such a case it seems very reasonable to consider a decision 
function <i)^En) ®s ^beat** if the value of the Integral 

r|p(w(En), »*Cp.u3) 

becomes a minimum for (<)(En) ^ (i>^Bn)* ^ consider a 

decision function (o^En) i(S ^best** if it minimizes a certain 
weighted average of the risk function. ^ 

However, it is doubtful that a set function expressing our 
a priori degree of belief can ^meaningfully be constructed* 
Therefore, we prefer to, formulate the notion of a '*best'^ dec- 
ision function independently of such considerations* 



44 


Denote by r ^w(Eyy), vv| the least upper bound of 

r |^P|«(En), with respect to P, where F may^ bo any ele- 

ment of.n^ 

Definition 4 . A decision function a)^(Ej^) is said to be a 
••besjjj* decision function if r^wCE^), ® mini- 
mum for w(Ejj) » w’*'(E^). (The weight function is con- 

sidered fixed.) 

This definition of a ’’best” decision function seems to be 
a very reasonable one, although it is not the only possible one. 
One could reasonably define a decision function as "best'' if it 
minimizes a certain weighted average of the risk function. 
However, there are certain properties of the "best" decision 
function according to definition 4, which seem to Justify the 
use of that definition. One of the most Important properties 

t 

of a "best" decision function in the sense of definition 4 is 
that the risk function is a constant, i.e.^it has the same 
value for all elements P of JTL . This has been shown in the 
case that.X\.ls a k-parameter family of distributions, and the 
weight function w[p, c^and the distribution functions P satisfy 
certain restrictive conditions. The constancy of the risk func- 
tion seems to be very desirable from the point of view of appli- 
cations since this property makes it possible to evaluate the 
exact magnitude of the risk associated with the statistical de- 
cision. In the theory of confidence intervals the confidence 
coefficient, a, l.e.,the probability that the confidence in- 
terval will cover the unknown parameter, is independent of the 
value of the unknown parameter. This fact, which is considered 
to be of basic importance in the theory of interval -estimation, 
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is analogous to the constancy of the risk function in our gen* 
eral theory^since 1-a can be considered in a certain sense as 
«the risk associated with the interval estimation. (The quantity 
l*a is exactly equal to the risk in the sense of our definiticn, 
if the weight function takes only the values 0 and 1.) 

Finally, I should like to make some remarks about the re- 
lationship of the general theory as outlined here, to the parti- 
cular theory of unlfomly most powerful and asymptotically most 
powerful tests which were discussed before. In the case of 
testing the simple hypothesis that the unknown distribution 
PCxj^, . . .,Xj^) is equal to a particular distribution P^Cxj^, . . .Xjj), 
the system S of subsets of .jOl consists only of two elements 
and 02 ''^here (i)x contains the single element Fq and cog is the 
complement of oi in/\. Hgnce, the decision function (t>(E^) can 
assume merely the values and 02 » Let be the sublet of 
the sample space consisting of the points for which 
and let be the set of points Eq for which u>(En)=W2* The 
set is the complement of in the sample space. Obviously 
the set is the critical region, in the sense of the Neyman- 
Pearson theory. It is easy to see that if for any a(0<a<l) a 
uniformly best critical region of slse a for testing P * P© 
exists, then for any arbitrary weight function and for any 
adhiissible (see definition 3) decision function c>)(£f^)^ the set 
h (02 will beaunifonnly best critical region. In particular, the 
set corresponding to the ’’best'' decision function (see def- 
inition 4) will be a uni/omly best critical region. Hence, the 

form of the weight function affects merely the size of the re- 

. 

gion n<dg associated with thg **best*' decision function 
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but it will always be a uniformly best critical region in the 
sense of the Neyman^Pearson theory. Similar considerations 
hold concerning asymptotically most powerful tests. Let the 
sequence {WjJ (nal,2, • • • ,ad inf.) of critical regions be an as» • 
yuptotically most powerful test for testing the simple hypothe- 
sis F a Fq* Then for sufficiently large n the region Wn i® 
practically a unifonnly best critical region and, therefore, it 
will be an excellent approximation to the region which is ^best® 
in the sense of definition 4 irrespective of the shape of the 
weight function of errors. 

As we have seen, for building up a general theory of 
Sjtatistioal Inference, the following three steps have to be 
madet 

1. Fonnulation of the general problem of statistical 

inference. , 

2. Definition of the ^best® procedure for making sta- 
tistical decisions, i.e., definition of the '’best'* 
statistical decision function. 

3. Solution of the mathematical problem of calculating 
the "best" statistical decision function. 

The problem of statistical inference, as we have formulated 
it herei seems to be sufficiently broad to cover the problems in 
practical applications. The second step will always be, to a 
certain extent, arbitrary. The definition of "best" decision 
function given here seems to be a satisfactory one. horeover, 
under certain restrictive conditions it has the important prop- 
erty that the risk function associated with the "best" decision 
function is constant, i.e., it has the same value for all ele- 
ments of..n.t However, there may be other definitions of a 
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'*b60t'* daclalon function worth investigating. Decision fune* 
tions which minimise a certain average of the risk function may 
be of speciifi. interest. Concerning step 3, there are many 
mathematical problems as yet unsolved. 
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